A Market Analysis for Beers and Breweries in the USA. MidTerm Project by: [ Mani Sowmya & George Ndede ].

With the recent rise in popularity of craft beer, a new generation of beer pioneers has been created that no longer is satisfied with the standard beer styles offered by big commercial breweries, such as ABInbev. Among the most popular of the craft beer styles is the Imperial Pale Ale, often referred to a the “IPA”. It has become a major source of revenue for many craft breweries as the trend has sweeped the globe. In order to compete, AbInbev has been discussing the potential of releasing a “Budweiser IPA”, which will be a market response to promote Budweiser to a wider audience. The following exploratory analysis aims to provide a thorough understanding of what defines an IPA, along with the best suggestion for a regional trial.

All supporting discussion about code will be in blue italics.

library(dplyr)
library(here)
library(ggplot2)
library(tidyr)
library(ggthemes)
library(doBy)
library(reshape)
library(plotly)
library(GGally)
library(caret)
library(e1071)
library(class)
library(usmap)
library(randomForest)

First, we create some custom theme to ensure homogeneity throughout the presentation deck.

# set global theme
# custom theme
bg_color = "#999495"
bar_color = "#6539b8"
text_main = "#FFFFF"
text_ticks = "#CFC6C7"
axis_lines = #756f70"
cust_theme <- theme(plot.title = element_text(color = 'white', vjust=0.5, hjust=0.5),
                    plot.background = element_rect(fill = bg_color),
                    panel.background = element_rect(fill = bg_color),
                    axis.text.x=element_text(angle=90,hjust=1),
                    axis.text = element_text(colour = text_ticks),
                    panel.grid.major.x = element_blank(),
                    panel.grid.minor.x = element_blank(),
                    panel.grid.major.y =  element_line(colour = "#453a3b", linetype = 1, size = 0.25),
                    panel.grid.minor.y = element_blank(),
                    axis.title = element_text(colour = "yellow"))

# overwrite default theme with custom
theme_set(theme_foundation() + cust_theme)

Then, we created a relative file path from the “here” library for better reproducibility. After setting our working directory, we imported the files into data frames for better transformation capabilities.

# set relative brewery file path as variable
beers_csv <- here("project_data_files", "Beers.csv")
breweries_csv <- here("project_data_files", "Breweries.csv")

# initializing the dataframes for import
beers <- data.frame(read.csv(beers_csv))
breweries <- data.frame(read.csv(breweries_csv))

Part 1: How many Breweries are in each State.

Grouped by States in order to count the total number of Breweries then plot results as a bar chart using ggplot (and plotly).

According the data set provided, it appears that the states with the highest craft brewery count are Colorado, California, Michigan, Oregon, and Texas. We can make an deduce that these markets are most appreciative towards craft beer and would provide a great test market for the new IPA release.

# Question 1 - How many breweries are present in each state?
per_state <- breweries %>% group_by(State) %>% tally(name="breweries_per_state") %>% arrange(desc(breweries_per_state))
Diagram1 <- per_state %>% ggplot(aes(x=reorder(State, breweries_per_state), y=breweries_per_state)) + geom_bar(stat="identity", fill=bar_color) + ggtitle("Total Breweries per State") + ylab("Total Breweries") + xlab("State Names")

ggplotly(Diagram1)

This is a Choropleth of the brewery count per state to provide geographic representation of the bar chart above.

# this is a choropleth of breweries per state
per_state$State <- trimws(per_state$State)
per_state$hover  <- with(per_state, paste(State, '<br>', "Breweries", breweries_per_state))
Diagram1_2 <- plot_geo(per_state, locationmode = 'USA-states') %>%
  add_trace(
    z = ~breweries_per_state, text = ~hover, locations = ~State,
    color = ~breweries_per_state, colors = 'Purples'
  ) %>%
  colorbar(title = "Breweries Per State") %>%
  layout(
    title = 'Breweries per State in the Map<br>(Mouse over for detailed breakdown)',
    geo = list(
      scope = 'usa',
      projection = list(type = 'albers usa'),
      showlakes = TRUE,
      lakecolor = toRGB('white')
    )
  )

ggplotly(Diagram1_2)

Part 2: Merging Beer Data with Breweries Data.

We merge the two data sets (Beer and Breweries). In order to join the two data frames, we use the merge() function and joined the two data frames on common keys - brewery_id and brew_id. We also ensure to rename the columns for easier reuse and then displayed the results using an rbind() function to combine the heads and tails of the outcome data frame.

For our work to be easier, we made sure we made sure we the Beer and Breweries data set together.

# Question 2: Merge Beer data with Breweries data. Print head(6) and tail(6)
# Note: left join breweries on beers because 1-many relationship
main <- merge(beers, breweries, by.x="Brewery_id", by.y="Brew_ID")

# clean column names for good housekeeping
main_cols <- c("brewery_id", "beer_name", "beer_id", "abv", "ibu", "beer_style", "serving_ounces", "brewery_name", "city", "state")
names(main) <- main_cols

# print head and tail of resultant data set
print(rbind(head(main, 6), tail(main, 6)))
##      brewery_id                 beer_name beer_id   abv ibu
## 1             1              Get Together    2692 0.045  50
## 2             1             Maggie's Leap    2691 0.049  26
## 3             1                Wall's End    2690 0.048  19
## 4             1                   Pumpion    2689 0.060  38
## 5             1                Stronghold    2688 0.060  25
## 6             1               Parapet ESB    2687 0.056  47
## 2405        556             Pilsner Ukiah      98 0.055  NA
## 2406        557  Heinnieweisse Weissebier      52 0.049  NA
## 2407        557           Snapperhead IPA      51 0.068  NA
## 2408        557         Moo Thunder Stout      50 0.049  NA
## 2409        557         Porkslap Pale Ale      49 0.043  NA
## 2410        558 Urban Wilderness Pale Ale      30 0.049  NA
##                               beer_style serving_ounces
## 1                           American IPA             16
## 2                     Milk / Sweet Stout             16
## 3                      English Brown Ale             16
## 4                            Pumpkin Ale             16
## 5                        American Porter             16
## 6    Extra Special / Strong Bitter (ESB)             16
## 2405                     German Pilsener             12
## 2406                          Hefeweizen             12
## 2407                        American IPA             12
## 2408                  Milk / Sweet Stout             12
## 2409             American Pale Ale (APA)             12
## 2410                    English Pale Ale             12
##                       brewery_name          city state
## 1               NorthGate Brewing    Minneapolis    MN
## 2               NorthGate Brewing    Minneapolis    MN
## 3               NorthGate Brewing    Minneapolis    MN
## 4               NorthGate Brewing    Minneapolis    MN
## 5               NorthGate Brewing    Minneapolis    MN
## 6               NorthGate Brewing    Minneapolis    MN
## 2405         Ukiah Brewing Company         Ukiah    CA
## 2406       Butternuts Beer and Ale Garrattsville    NY
## 2407       Butternuts Beer and Ale Garrattsville    NY
## 2408       Butternuts Beer and Ale Garrattsville    NY
## 2409       Butternuts Beer and Ale Garrattsville    NY
## 2410 Sleeping Lady Brewing Company     Anchorage    AK

Part 3: Addressing the missing values in each column.

To address the missing values in each colum, we filter the data frame to only null values using is.na() function, and then took the count of each column using colSums() function. We see in the output that the most null value is in the International Bitterness Unit (IBU). Almost half of the IBU values are missing! To make it possible to fill in missing values, we made an assumption that the median values for each style should be fairly close to the actual values. The assumption is that each beer style has ranges on IBU, therefore, we know a good range of potential values could lie in the median.

In order to use the dataset for future statistical modeling and machine learning projects, it’s important to treat any missing values in the dataset. To do this, we will follow two common approaches:

  1. remove records with null values
  2. replace null values with median_taste from each beer style
# Part 3: Address NA values in each column
# find the NA count per column to decide next steps...you will see some missing abv and A LOT of missing ibu
print(colSums(is.na(main)))
##     brewery_id      beer_name        beer_id            abv            ibu 
##              0              0              0             62           1005 
##     beer_style serving_ounces   brewery_name           city          state 
##              0              0              0              0              0
# Preserve a version of merged dataset by removing all NAs
main_clean<-na.omit(main)

Note: Because we were supposed to use approach #1, we did not use the corrected values for any of the machine learning analysis. We just wanted to provide these details to show more depth in our approach and provide an additional method that we could have followed

You can see how by providing a median value, we were able to reduce the missing IBU values to 52 (see ibu_corr) and the missing ABV values to 0 (see abv_corr).

# based on evidence, will use median abv and ibu per beer style to fill na
main <- main %>% group_by(beer_style) %>% mutate(ibu_corr = ifelse(is.na(ibu), median(ibu, na.rm = TRUE), ibu), abv_corr = ifelse(is.na(abv), median(abv, na.rm = TRUE), abv))

# let's see how we did...you will see all abv "corrected" and IBU had over 950 values "corrected"
print(colSums(is.na(main)))
##     brewery_id      beer_name        beer_id            abv            ibu 
##              0              0              0             62           1005 
##     beer_style serving_ounces   brewery_name           city          state 
##              0              0              0              0              0 
##       ibu_corr       abv_corr 
##             52              0

The results of this process is written to a CSV for future retrieval.

# Export the clean n0-NAs file to a new CSV file
write.csv(main_clean,"./BreweryBeerCleanDS.csv", row.names = FALSE)

Part 4: Computing Median Alcohol By Volume (ABV) and International Bitterness Units (IBU) for each State.

After preparing a file without null values, we now compute the median (ABV) and (IBU) for the data set to give us a clear picture of the central values to expect from all the beers. We accomplish this using a grouping() function in the dplyr’s package in order to perform a median aggregation on each column.

We answer the questions: i) What ABV and IBU levels do craft beer consumers gravitate towards?
ii) Where does traditional Budweiser fall on this scale?
iii) Based on the evidence in the data set provided, the median ABV of the craft beer for each state provided is between 4.0 - 6.7%. This is a narrow spread, suggesting that Alcohol by Volume level between the median ranges would likely be suitable to consumers in all the United States. While looking at the median IBU values, there is a wider spread between states. What this would suggest is that different states have different views on how bitter their beer should be (IBU level). Budweiser is found right in the middle of the ABV range at 5% ABV, but much lower than the median IBU for all states in the data set of (12 IBU)! Conclusion, the evidence suggests that craft beer consumers might prefer Budweiser if it were more bitter (greater IBU) than it is.

# Part 4: Compute Median ABV and IBU and do bar plot
median_taste <- main_clean %>% group_by(state) %>% summarise(median_abv = median(abv), median_ibu = median(ibu))

#Bar_Chart_Plotter

# IBU bar plot
Diagram2 <- median_taste %>% ggplot() + geom_bar(aes(x=reorder(state, -median_ibu), y=median_ibu), stat="identity", fill=bar_color) + ggtitle("Median IBU per State") +ylab("Median IBU") + xlab("State") 

# ABV bar plot
Diagram3 <- median_taste %>% ggplot() + geom_bar(aes(x=reorder(state, -median_abv), y=median_abv*100), stat="identity", fill=bar_color) + ggtitle("Median ABV per State") +ylab("Median ABV") + xlab("State") + expand_limits(y=c(0, max(median_taste$median_abv+0.5)))

ggplotly(Diagram2)
ggplotly(Diagram3)

Below are Choropleths of median IBU and ABV per state to provide a geographic analysis of the bar charts above.

# Map for Median IBU per State

median_taste$state <- trimws(median_taste$state)
median_taste$hover  <- with(median_taste, paste(state, '<br>', "IBU", median_ibu))
Diagram2_2 <- plot_geo(median_taste, locationmode = 'USA-states') %>%
  add_trace(
    z = ~median_ibu, text = ~hover, locations = ~state,
    color = ~median_ibu, colors = 'Greens'
  ) %>%
  colorbar(title = "Median IBU per State") %>%
  layout(
    title = 'Median IBU per State in Map<br>(Hover for breakdown)',
    geo = list(
      scope = 'usa',
      projection = list(type = 'albers usa'),
      showlakes = TRUE,
      lakecolor = toRGB('white')
    )
  )

ggplotly(Diagram2_2)
# Map for Median ABV per State

median_taste$state <- trimws(median_taste$state)
median_taste$hover  <- with(median_taste, paste(state, '<br>', "ABV", median_abv))
Diagram3_2 <- plot_geo(median_taste, locationmode = 'USA-states') %>%
  add_trace(
    z = ~median_abv, text = ~hover, locations = ~state,
    color = ~median_abv, colors = 'Reds'
  ) %>%
  colorbar(title = "Median Alcohol by Volume per State") %>%
  layout(
    title = 'Median Alcohol by Volume per State in Map<br>(Point Mouse on a State to see details)',
    geo = list(
      scope = 'usa',
      projection = list(type = 'albers usa'),
      showlakes = TRUE,
      lakecolor = toRGB('white')
    )
  )

ggplotly(Diagram3_2)

Part 5: Which State has the Maximum Alcohol By Volume (ABV) & Which State has Maximum International Bitternes Units (IBU).

In order to get an idea of the maximum values in the dataset, we perform a simple lookup of maximum ABC and IBU for the dataset. We do this by utilizing “which” to look up the index of the max value specified. From there, we filter our dataset to only include that particular index.

What are the limits for ABV and IBU levels of craft beer? How high are consumers willing to go? Based on the evidence provided in the dataset, it appears that the maximum ABV among the craft beers is 12.8% (~2.5x Budweiser) with a maximum IBU of 138 (11.5x Budweiser).

#Part5: The State with max ABV and max IBU
#The item with max ABV
beer_MaxAbv <- main_clean[which.max(main_clean$abv),]
print(paste0("The state has maximum alcoholic beer is:", beer_MaxAbv$state, " with ABV of ", beer_MaxAbv$abv))
## [1] "The state has maximum alcoholic beer is: KY with ABV of 0.125"
#The item with max IBU
beer_MaxIbu <- main_clean[which.max(main_clean$ibu),]
print(paste0("The state has maximum bitterness beer is:", beer_MaxIbu$state, " with IBU of ", beer_MaxIbu$ibu))
## [1] "The state has maximum bitterness beer is: OR with IBU of 138"

Part 6: Comment on Summary of Statistics and Distribution of ABV.

To get an idea of the overall statistical distribution of the abv data, we then used “summary” to return a statistical summary of the abv data and then create a histogram to show normality and spread.

Looking further into the distribution of the ABV levels of craft beers, it appears that the average ABV of craft beer is 5.97%, with a slight right skew. According to the median, 50% of craft beers are between 0-5.6% ABV and another 50% are between 5.6 and 12.8%.

#Part6: The summary statistics and distribution of the ABV
summary(main_clean$abv)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02700 0.05000 0.05700 0.05991 0.06800 0.12500
Diagram4 <- main_clean %>% ggplot() + geom_histogram(aes(x=abv*100), fill=bar_color) + ggtitle("Distribution of ABV") + xlab("Percent ABV")

ggplotly(Diagram4)

Part 7: Relationship between the International Bitternes Units (IBU) of Beer and Alcohol By Volume (ABV).

Two of the most principle characteristics of beer are the International Bitternes Units and abv measurements. What’s the relationship between these two? In order to test for this, we created a scatterplot of Alcohol By Volume vs. International Bitternes Units. You can see that the positive correlation between the two suggests that higher alcohol content is common among higher Alcohol By Volume beers.

#Part7: Relationship between IBU and ABV?
#Scatterplot
Diagram5 <- main_clean %>% ggplot(aes(ibu, abv*100)) + geom_point(colour=bar_color, na.rm=TRUE)+geom_smooth(method=lm, se=FALSE,color="white", linetype="dashed", na.rm=TRUE) + ggtitle("ABV vs. IBU") + ylab("abv")

ggplotly(Diagram5)
## `geom_smooth()` using formula 'y ~ x'

Part 8: KNN model and Machine Learning Based on IBU and ABV.

The first steps are preparing the data by splitting and classifying accordingly.

Next, we will create an estimation model in order to classify beers (IPA vs. Non-IPA) based on their alchol and IBU levels. The method used here will be K-nearest neighbors, which will leverage classification of known beers to compare the test value to the known. This could help us in our investigation by assisting in labeling beer styles of various other samples in addition to testing if our new Budweiser IPA will meet similar ABV and IBU to the supplied IPA’s in the dataset.

#Part8: KNN used for analysis IBU and ABV relationship of IPA and Ale(no IPA)

#Form Dataset with beer style IPA 
main_IPA <- main_clean[grep("IPA",main_clean$beer_style),]
#Exclude the "Ale" part within IPA subset
main_IPA <- main_IPA[!grepl("Ale", main_IPA$beer_style),]

print(head(main_IPA))
##    brewery_id      beer_name beer_id   abv ibu                     beer_style
## 1           1   Get Together    2692 0.045  50                   American IPA
## 7           2 Citra Ass Down    2686 0.080  68 American Double / Imperial IPA
## 15          2    Rico Sauvin    2678 0.076  68 American Double / Imperial IPA
## 18          2   Pile of Face    2675 0.060  65                   American IPA
## 25          4 Habitus (2014)    2668 0.080 100 American Double / Imperial IPA
## 26          4          Solis    2667 0.075  85                   American IPA
##    serving_ounces              brewery_name        city state
## 1              16        NorthGate Brewing  Minneapolis    MN
## 7              16 Against the Grain Brewery  Louisville    KY
## 15             16 Against the Grain Brewery  Louisville    KY
## 18             16 Against the Grain Brewery  Louisville    KY
## 25             16 Mike Hess Brewing Company   San Diego    CA
## 26             16 Mike Hess Brewing Company   San Diego    CA
#Form Dataset with beer style Ale 
main_Ale <- main_clean[grep("Ale",main_clean$beer_style),]
#Exclude the "IPA" part within Ale subset
main_Ale <- main_Ale[!grepl("IPA", main_Ale$beer_style),]

print(head(main_Ale))
##    brewery_id        beer_name beer_id   abv ibu              beer_style
## 3           1       Wall's End    2690 0.048  19       English Brown Ale
## 4           1          Pumpion    2689 0.060  38             Pumpkin Ale
## 10          2           A Beer    2683 0.042  42 American Pale Ale (APA)
## 12          2    Flesh Gourd'n    2681 0.066  21             Pumpkin Ale
## 13          2         Sho'nuff    2680 0.040  13        Belgian Pale Ale
## 16          2 Coq de la Marche    2677 0.051  38  Saison / Farmhouse Ale
##    serving_ounces              brewery_name        city state
## 3              16        NorthGate Brewing  Minneapolis    MN
## 4              16        NorthGate Brewing  Minneapolis    MN
## 10             16 Against the Grain Brewery  Louisville    KY
## 12             16 Against the Grain Brewery  Louisville    KY
## 13             16 Against the Grain Brewery  Louisville    KY
## 16             16 Against the Grain Brewery  Louisville    KY

Next, we will plot each seperate classification by their associated parameters in order to identify correlation between ABV and IBU for each group.

#Scatterplot of IPA and Ale datasets
Diagram6 <- main_IPA %>% ggplot(mapping = aes(ibu, abv)) + geom_point(colour = bar_color, na.rm=TRUE)+geom_smooth(method=lm, se=FALSE, na.rm=TRUE, colour="white") + ggtitle("IPA: ABV vs. IBU")
Diagram7 <- main_Ale %>% ggplot(mapping = aes(ibu, abv)) + geom_point(color = bar_color, na.rm=TRUE)+geom_smooth(method=lm, se=FALSE, na.rm=TRUE, colour="white") + ggtitle("Ale: ABV vs. IBU")

ggplotly(Diagram6)
## `geom_smooth()` using formula 'y ~ x'
ggplotly(Diagram7)
## `geom_smooth()` using formula 'y ~ x'

Now, we will rejoin the two groups together to continue to the modeling phase

#Select cols in IPA and Ale datasets and set group name
#Group numbers: Ale--1, IPA--2
test_Ale <- main_Ale %>% select(ibu,abv)
test_Ale$flag = "ALE"
test_IPA <- main_IPA %>% select(ibu,abv)
test_IPA$flag = "IPA"
test_4KNN <- rbind(test_Ale, test_IPA)  #Combine two subsets into one for later classification
test_4KNN$flag <- as.factor(test_4KNN$flag)

To further help with visualizing the IBU and ABV differences between IPA and non-IPA beers, we plotted the following scatter, colored by beer classification. You can see below how there are noticeable clusters being formed between IPA and non-IPA beers with the limit being somewhere around 50 IBU.

#Plot IPA and Ale datasets together in a single scatterplot
Diagram7_1 <- test_4KNN %>% ggplot(mapping = aes(ibu, abv,color=test_4KNN$flag)) + geom_point(na.rm=TRUE) + geom_smooth(method=lm, se=FALSE, na.rm=TRUE, linetype="dashed") + scale_colour_manual(name="Beer Style", values=c("ALE" = "#FF652F","IPA" ="#14A76C")) + ggtitle("ABV vs. IBU by Beer Style: KNN Test")
ggplotly(Diagram7_1)
## Warning: Use of `test_4KNN$flag` is discouraged. Use `flag` instead.

## Warning: Use of `test_4KNN$flag` is discouraged. Use `flag` instead.
## `geom_smooth()` using formula 'y ~ x'

Next, we will split our data set into train-split subsets. We have specified that 70% will be used as training set, and 30% will be used for testing. Additionally, the seed has been set to 123 for reproducibility.

#Divide dataset into train and test 
set.seed(123)
trainIndex = sample(seq(1:937), 650)
trainBeers = test_4KNN[trainIndex,]
testBeers = test_4KNN[-trainIndex,] 

#Train Data Visualization
Diagram7_2 <- trainBeers %>% ggplot(aes(x = abv,fill = flag)) + geom_histogram() + facet_grid(rows = vars(flag))
Diagram7_3 <- trainBeers %>% ggplot(aes(x = ibu,fill = flag)) + geom_histogram() + facet_grid(rows = vars(flag))

ggplotly(Diagram7_2)
ggplotly(Diagram7_3)

Examining the KNN training data for correlation to ensure that we selected proper classification parameters. We see noticeable differences between the distributions of both groups, this is a sign for the model.

trainBeers$flag = as.factor(trainBeers$flag)
#Data set to get the Distribution of Ale, IPA and IBU data
Diagram7_4 <- trainBeers %>% select(abv, ibu, flag) %>% ggpairs(aes(color = flag)) + ggtitle("Distribution of Ale and IPA IBU Data")
ggplotly(Diagram7_4)
## Warning: All elements of `...` must be named.
## Did you want `key = c(key)`?

## Warning: All elements of `...` must be named.
## Did you want `key = c(key)`?

## Warning: All elements of `...` must be named.
## Did you want `key = c(key)`?

## Warning: All elements of `...` must be named.
## Did you want `key = c(key)`?
## Warning: Can only have one: highlight
## Warning: All elements of `...` must be named.
## Did you want `key = c(key)`?

## Warning: All elements of `...` must be named.
## Did you want `key = c(key)`?

## Warning: All elements of `...` must be named.
## Did you want `key = c(key)`?

Below is us actually running the KNN model. We will add commentary after running an additional function for aesthetic appeal.

##Classification Method 1: KNN
#Train the model (k=5)
classifications = knn(trainBeers[c(1,2)],testBeers[c(1,2)],trainBeers$flag, prob = TRUE, k = 5)

#The resulting confusion matrix
CM_k5 = confusionMatrix(table(testBeers[,'flag'],classifications))

Below is a function for creating an aesthetically appealing confusion matrix

draw_confusion_matrix <- function(cm) {
  
  layout(matrix(c(1,1,2)))
  par(mar=c(2,2,2,2))
  plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')
  title('CONFUSION MATRIX', cex.main=2)
  
  # create the matrix 
  rect(150, 430, 240, 370, col='#00b300')
  text(195, 435, 'Ale', cex=1.2)
  rect(250, 430, 340, 370, col='#ff639c')
  text(295, 435, 'IPA', cex=1.2)
  text(125, 370, 'Predicted', cex=1.3, srt=90, font=2)
  text(245, 450, 'Actual', cex=1.3, font=2)
  rect(150, 305, 240, 365, col='#ff639c')
  rect(250, 305, 340, 365, col='#00b300')
  text(140, 400, 'Ale', cex=1.2, srt=90)
  text(140, 335, 'IPA', cex=1.2, srt=90)
  
  # add in the cm results 
  res <- as.numeric(cm$table)
  text(195, 400, res[1], cex=1.6, font=2, col='white')
  text(195, 335, res[2], cex=1.6, font=2, col='white')
  text(295, 400, res[3], cex=1.6, font=2, col='white')
  text(295, 335, res[4], cex=1.6, font=2, col='white')
  
  # add in the specifics 
  plot(c(100, 0), c(100, 0), type = "n", xlab="", ylab="", main = "DETAILS", xaxt='n', yaxt='n')
  text(10, 85, names(cm$byClass[1]), cex=1.2, font=2)
  text(10, 70, round(as.numeric(cm$byClass[1]), 3), cex=1.2)
  text(30, 85, names(cm$byClass[2]), cex=1.2, font=2)
  text(30, 70, round(as.numeric(cm$byClass[2]), 3), cex=1.2)
  text(50, 85, names(cm$byClass[5]), cex=1.2, font=2)
  text(50, 70, round(as.numeric(cm$byClass[5]), 3), cex=1.2)
  text(70, 85, names(cm$byClass[6]), cex=1.2, font=2)
  text(70, 70, round(as.numeric(cm$byClass[6]), 3), cex=1.2)
  text(90, 85, names(cm$byClass[7]), cex=1.2, font=2)
  text(90, 70, round(as.numeric(cm$byClass[7]), 3), cex=1.2)
  
  # add in the accuracy information 
  text(30, 35, names(cm$overall[1]), cex=1.5, font=2)
  text(30, 20, round(as.numeric(cm$overall[1]), 3), cex=1.4)
  text(70, 35, names(cm$overall[2]), cex=1.5, font=2)
  text(70, 20, round(as.numeric(cm$overall[2]), 3), cex=1.4)
} 

Below is the initial KNN training model. You will see that the model does a pretty good job at classifying between the two (definitely better than a straight guess). Additionally, it appears to perform specificity and sensitivity are very similar suggesting that the model is likely very balanced in it’s estimations.

KNN: K=5

draw_confusion_matrix(CM_k5)

Now let’s run the same model with k=10.

The above is based on k-value is five…what about k is equal to ten? You can see here that the accuracy is the same, with the increased specificity offsetting a decrease in sensitivity.

KNN: K=10

#Alternative K to train the model (k=10)
classifications_k10 = knn(trainBeers[c(1,2)],testBeers[c(1,2)],trainBeers$flag, prob = TRUE, k = 10)

#The resulting confusion matrix
CM_k10 = confusionMatrix(table(testBeers[,'flag'],classifications_k10))
draw_confusion_matrix(CM_k10)

In order to properly train our hyperparameter, k-value, we will perform an interactive loop with 100 random samples and 90 different k-values. From there, we calculate the average model accuracy in order to choose the best k-value.

##################################################################
# Loop for many k and the average of many training / test partition

set.seed(1)
iterations = 100
numks = 90
splitPerc = .7

masterAcc = matrix(nrow = iterations, ncol = numks)

for(j in 1:iterations)
{
  trainIndices = sample(1:dim(test_4KNN)[1],round(splitPerc * dim(test_4KNN)[1]))
  train = test_4KNN[trainIndices,]
  test = test_4KNN[-trainIndices,]
  for(i in 1:numks)
  {
    classifications = knn(train[,c(1,2)],test[,c(1,2)],train$flag, prob = TRUE, k = i)
    table(classifications,test$flag)
    CM = confusionMatrix(table(classifications,test$flag))
    masterAcc[j,i] = CM$overall[1]
  }
  
}

MeanAcc = colMeans(masterAcc)

plot(seq(1,numks,1),MeanAcc, type = "l")

k_opt <- which.max(MeanAcc)
max_acc <- round(max(MeanAcc),4)*100

Based on 100 random samples and 90 different hyperparameter values, our model is capable of up to 85.7% accuracy using the 5 closest points on the scatter plot for classification. This will be a very good way to test whether the new Budweiser IPA meets the requirements for what’s considered an IPA by the craft beer market. How about other popular methods? Let’s try some different models to see if any have any significantly better results.

Naive Bayes

# now let's do some comparison with other machine learning methods
## Classification Method 2: Naive Bayes
classifications_nb = naiveBayes(trainBeers[,c(1,2)],as.factor(trainBeers$flag))
CM_nb = confusionMatrix(table(predict(classifications_nb,testBeers[,c(1,2)]),as.factor(testBeers$flag)))
draw_confusion_matrix(CM_nb)

Support Vector Machine

## Classification Method 3: SVM
classifications_svm = svm(trainBeers[,c(1,2)],as.factor(trainBeers$flag))
CM_svm = confusionMatrix(table(predict(classifications_svm,testBeers[,c(1,2)]),testBeers$flag))
draw_confusion_matrix(CM_svm)

Random Forest

## Classification Method 4: RandomForest
classifications_rf = randomForest(trainBeers[,c(1,2)],as.factor(trainBeers$flag),ntree=500)
CM_rf = confusionMatrix(table(predict(classifications_rf,testBeers[,c(1,2)]),testBeers$flag))
draw_confusion_matrix(CM_rf)

Conclusion: You can see that although some of the other methods might have had a slight increase in accuracy, they all had pretty similar results from a practical standpoint (+/- 3%). Therefore, our decision to proceed with a KNN model seems justified, however, we could adjust to another model with the same parameters and expect to maintain similar accuracy.

Part 9: IPA per 100k population (additional exploration).

Now that we have created a way to classify and test our new IPA, we need to investigate the best market to perform product trials.

Initially, we will use the grep function to tally only IPA values and then filter our dataset accordingly.

# Part 9: Additional Data Exploration
# Plan - to count % IPA per total craft beers in each state.

# finding the beers containing "IPA" in their style and tallying
cat_freq <- main %>% mutate(category=
  case_when(
    grepl("IPA", beer_style) ~ "IPA",
    TRUE ~ "OTHER"
  )
) %>% group_by(state, category) %>% summarise(n=n()) %>% mutate(freq = n/sum(n)) %>% filter(category=="IPA")
## `summarise()` has grouped output by 'state'. You can override using the `.groups` argument.
cat_freq$state <- trimws(cat_freq$state)

# adding full state names from merging with us_map() library function
cat_freq <- merge(distinct(us_map(), full, abbr), cat_freq, by.x="abbr", by.y="state", all.x=TRUE) %>% mutate(n=replace_na(n, 0), freq=replace_na(freq, 0))

Next, let’s calculate our relative frequency of IPA to total craft beers per state.

# create percentage from frequency
cat_freq$freq = round(cat_freq$freq, digits=4)*100

After realizing the impact of population on market potential, I downloaded the 2019 consensus data in order to perform a “correction” for state population. My goal was to the best balance of IPA count per 100k consumers in order to find the best markets to perform trials before scaling to mass production.

# creatig dataframe from state consensus
state_pop <- data.frame(read.csv(here("project_data_files", "2019 consensus data.csv")))

# merging dataframe with %IPA per state dataframe
normalized_IPA <- merge(state_pop, cat_freq, by.x="State", by.y="full")

# normalizing IPA count per 100k state population
normalized_IPA <- normalized_IPA %>% mutate(IPA_per_100k = n/(Pop / 100000)) %>% arrange(desc(IPA_per_100k))

# creating a hover tag for the map
normalized_IPA$hover <- with(normalized_IPA, paste(State, '<br>', "IPA Beers", n, "<br>","IPA % of total craft beers", freq,"%", "<br>", "Population Rank", rank, "<br>", "IPA per 100k", IPA_per_100k))

Based on my calculations, you’ll see the following states are the best individual candidates for a trial run of Budeweiser IPA, according to 2019 state population and IPA interest.

# displaying a table of top 5 markets for reference
Diagram8 <- normalized_IPA %>% select(State, Pop, IPA_per_100k) %>% top_n(8)
Diagram8 %>% plot_ly(type='table',
                      header = list(
                        values = c("State Name", sprintf("State Population (%s Sample Size)",sum(Diagram8$Pop)), "IPA per 100k People")),
                      cells=list(
                        values=t(unname(as.matrix(Diagram8))),
                        font = list(color = c('#f50f07'), size = 14)
                      ))

Now that we have normalized our dataset, let’s identify the region with the best potential for the IPA release. You can see from the results below, that if we were to combine the state IPA data to a particular region, it appears that the Pacific Northwest may be the best market to enter with hard initial avoidance in the Southeast Region. Because of the geographic launch to the West Coast, my recommendation is to launch a Budweiser West Coast IPA along the Pacific Northwest for trials before pushing the product eastward. Before eastward expansion, it may prove beneficial to test the product in Vermont and trickle downward, hitting the southeast region once the product has been adequately developed for mass production.

# displaying all US India Pale Ale markets on map
Diagram9 <- plot_geo(normalized_IPA, locationmode = 'USA-states') %>%
  add_trace(
    z = ~IPA_per_100k, text=~hover, locations = ~abbr,
    color = ~IPA_per_100k,
    marker = list(line = list(color = toRGB(bg_color), width = 2.25)),colorscale='MAGMA'
  ) %>%
  colorbar(title = "IPA") %>%
  layout(
    title = 'India Pale Ale count per 100k',
    font = list(color = 'yellow'),
    geo=list(
      scope = 'usa',
      projection = list(type = 'albers usa'),
      showlakes = FALSE,
      bgcolor=toRGB(bg_color, alpha = 1)),
    paper_bgcolor=toRGB(bg_color, alpha = 1),
    margin=list(l=10, r=30, t=100, b=30)
  )

Diagram9